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Abstract 

This paper studies the statistical properties of the group Lasso estimator for 
high dimensional sparse quantile regression models where the number of explana- 
tory variables (or the number of groups of explanatory variables) is possibly much 
larger than the sample size while the number of variables in "active" groups is 
sufficiently small. We establish a non-asymptotic bound on the ^2-estimation er- 
ror of the estimator. This bound explains situations under which the group Lasso 
estimator is potentially superior /inferior to the £i-penalized quantile regression 
estimator in terms of the estimation error. We also propose a data-dependent 
choice of the tuning parameter to make the method more practical, by extending 
the original proposal of Belloni and Chernozhukov (2011) for the £i-penalized 
quantile regression estimator. As an application, we analyze high dimensional 
additive quantile regression models. We show that under a set of suitable reg- 
ularity conditions, the group Lasso estimator can attain the convergence rate 
arbitrarily close to the oracle rate. Finally, we conduct simulations experiments 
to examine our theoretical results. 

AMS2010 subject classifications: 62G05, 62J99 

Key words: additive model, group Lasso, non-asymptotic bound, quantile re- 
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1 Introduction 

During the last decade, a great deal of attention has been paid for penalization meth- 
ods to the estimation of high dimensional sparse statistical models where the number 
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of explanatory variables is possibly larger than the sample size while the number of 
"active" variables is sufficiently small. The most popular penalization method would 
be the ^i-penahzation which is coined as "Lasso" (Tibshirani, 1996) for the linear re- 
gression case. A number of researchers have studied the statistical properties such as 
the ^2-estimation error and the model selection property of the ^i-penalized estimator 
for high dimensional linear regression models (Bunea et al., 2007a,b; Zhao and Yu, 
2007; Zhang and Huang, 2008; Wainwright, 2009; Meinshausen and Yu, 2009; Bickel 
et al., 2009), for high dimensional generahzed linear models such as logistic regression 
models (van de Geer, 2008; Nagahban et al., 2010), and for high dimensional quantile 
regression models (Belloni and Chernozhukov, 2011). Another important penalization 
method is the group Lasso (Yuan and Lin , 2006), which intends to select groups of 
variables instead of selecting variables individually. In this paper, we study the sta- 
tistical properties of the group Lasso for high dimensional sparse quantile regression 
models. 

Since the seminal work of Koenker and Bassett (1978), quantile regression is one 
of the main topics in statistics and econometrics. An attractive feature of quantile 
regression is that it allows us to make inference on the entire conditional distribu- 
tion by estimating several different conditional quantiles. Wc refer to Koenker (2005) 
for a standard textbook on quantile regression. The recent work of Belloni and Cher- 
nozhukov (2011) established bounds on the estimation error and the number of selected 
variables of the £i-penalized quantile regression estimator. In particular, they estab- 
lished that, with a suitable choice of the tuning parameter, the £i-penalized quantile 
regression estimator attains the near oracle convergence rate. Furthermore, they pro- 
posed a data-dependent choice of the tiining parameter to make the method more 
practical. Their work is thought be a breakthrough to the study of penalization meth- 
ods for high dimensional sparse quantile regression models. 

The contributions of this papers are threefold. The first and main contribution is to 
establish a non-asymptotic bound on the ^2-estimation error of the group Lasso estima- 
tor for high dimensional sparse quantile regression models. In particular, we derive a 
bound that can explain situations under which the group Lasso estimator is potentially 
superior/inferior to the £i-penalized estimator. The group Lasso estimation that we 
study requires a prior knowledge on the sparsity pattern of the parameter vector, i.e., 
a prior knowledge that the parameter vector is groupwise sparse. Intuitively, the group 
Lasso should have a superior estimation performance to the ^i-penalization when the 
prior knowledge is "accurate" . Our result formally gives a theoretical support on this 
intuition. It should be noted that in contrast to Belloni and Chernozhukov (2011) 
who focused on the zero bias case where the conditional quantile function has an exact 
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sparse representation of basis functions, we allow for the non-zero bias case where the 
conditional quantile function may not have an exact sparse representation but can be 
reasonably well approximated by a sparse linear combination of basis functions. The 
second contribution, which is less original, is to extend a data-dependent choice of 
the tuning parameter, originally proposed by Belloni and Chernozhukov (2011), to the 
group Lasso case. Although their original proposal is restricted to the zero bias case, 
we show that the proposed data-dependent choice is asymptotically valid even for the 
non-zero bias case under suitable conditions. Third, we apply our general results to the 
estimation of high dimensional sparse additive quantile regression models. We allow 
for the possibility that the number of explanatory variables is much larger than the 
sample size but assume that the number of active variables is fixed. The additive com- 
ponents are approximated by truncated series expansions with suitable basis functions. 
With this approximation, the variables selection becomes the group selection of coef- 
ficients in this expansion. In this regard, the group Lasso is suited to the estimation 
of high dimensional additive models. We show that under a set of suitable regularity 
conditions, the group Lasso estimator can attain the convergence rate arbitrarily close 
to Stone's (1982, 1985) oracle rate n~^/^'^^'^^\ where v indicates the smoothness of the 
conditional quantile function. Such a result is new in the quantile regression literature. 
We also conduct simulation experiments to examine our theoretical results. The focus 
of this paper is on the estimation performance of the group Lasso estimator for quantile 
regression, and we do not formally discuss its model selection property. 

From a technical point of view, deriving a non- asymptotic bound that can ex- 
plain the benefit of the group Lasso for the quantile regression case is a delicate is- 
sue. One technical difficulty is that the objective function of quantile regression is 
non-differentiable, which implies that some techniques used in the analysis of linear 
regression models are not directly applicable to the quantile regression case. Further- 
more, a naive extension of Bellini and Chernozhukov's proof strategy will lead to a 
cruder bound that can not explain the benefit of the group Lasso. To that end, we 
make use of a Bernstein type inequality for vector- valued Rademacher processes, which 
turns out to be a key technical device to our result. We also use some materials on 
empirical processes theory (such as Talagrand's (1996) concentration inequahty) and 
geometric functional analysis to control the asymptotic behavior of many (possibly) 
large matrices arising from the group Lasso formulation. 

There are a number of papers on the group Lasso. Bach (2008), Nardi and Rinardo 
(2008), Huang and Zhang (2010), Wei and Zhang (2010), Obozinski et al. (2010) 
Lounici et al. (2010) studied the statistical properties of the group Lasso estimator for 
linear regression models. Lounici et al. (2010) listed some applications in which the 
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group Lasso is potentially useful. Meier et al. (2008) applied the group Lasso to logistic 
regression and established the convergence rate of the estimator; however, they did not 
demonstrate the benefit of the group Lasso over the ^i-penalty. Nagahban et al. (2010) 
established a deterministic bound on the general M-estimator with a decomposable 
penalty, which includes the group Lasso penalty as a special case; however they focused 
on smooth objective functions and did not cover the quantile regression case. They 
also did not demonstrate the benefit of the group Lasso over the ^i-penalty except 
for the hnear regression case. It should be noted that at least technically, mean and 
quantile regressions are significantly different, so that the analysis of the group Lasso 
for quantile regression requires a separate treatment. 

The application of the group Lasso (and its variants) to the estimation of non- 
parametric additive models has recently gained a lot of attention (Ravikumar et al., 
2009; Meier et al., 2009; Huang et al., 2010; Koltchinskii and Yuan, 2010; Raskutti et 
al., 2010). However, all these papers focused on smooth objective functions and did 
not cover the quantile regression case. Huang et al. (2010) derived the convergence 
rate of the group Lasso estimator for high dimensional sparse additive mean regression 
models. Their rate is n~'^^'^^'^~^^^ ^y\og{d V n) {d is the number of explanatory variables), 
which may be significantly slower than the optimal rate n"''/^^^"'"^) as d may be of an 
exponential order. Our result improves upon their rate result in the case of quantile 
regression (it should be noted that, however, the main concern of their paper is not on 
the standard group Lasso estimator but the adaptive group Lasso estimator) . 

The remainder of the paper is as follows. Section 2 introduces the model, the esti- 
mation method, and the computational method of the group Lasso estimate. Section 3 
presents the main results. We establish a non-asymptotic bound on the £2-cstimation 
error of the group Lasso estimator. Using this bound, we derive asymptotic bounds on 
the estimation error in typical situations. We make a brief comparison of the theoreti- 
cal performance of the group Lasso and the £i-penalized estimators. We then propose 
a data-dependent choice of the tuning parameter. Section 4 contains an application 
of our general results to the estimation of high dimensional sparse additive quantile 
regression models. Section 5 presents simulation results. 

We explain the notation used in the paper. For two sequences a — a{n) and 
b — b{n), we use the notation a < 6 if there exists a positive constant C independent 
of n such that a < Cb, a x 6 if a < 6 and b < a, and a <p b ii a < Cb with probability 
approaching one. For a, 6 e M, a A 6 := min{a, b} and aV b :— max{a, b}. Let E^'^"^ 
denote the unit sphere on IR*^ for a positive integer d. Let 0^ and Id denote the d- 
dimensional vectors consisting of zeros and ones only, respectively; let Id denote the 
d X d identity matrix. We use || • ||2 to indicate the Euclidean norm, and use || • ||o to 
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indicate the £o-seminorm. For a matrix A, let denote the operator norm of A. 
For a symmetric positive semidefinite matrix A, let A^^'^ denote the symmetric square 
root matrix of A. We sometimes use the notation X-i = (x2, . . . , Xp)' for x e R**. 

2 Preliminaries 

2.1 Model and estimation method 

We consider the quantile regression model 

Vi = g{Zi) + Ui, P{ui <0\Zi) = r, (2.1) 

where is a dependent variable, is a vector of d explanatory variables and r e (0, 1) 
is a quantile index. We assume that r is fixed. Let Z denote the support of Zi. 

Suppose that we have a set of basis functions {tjjj : j = l,...,p} on Z where 
ipi{z) = 1. For /3 = . . . , /3p)' G W, we have a series approximation: g{z) ^ 
Yl^=i (^j'4'j{z) ='■ 9(3{z)- Take a sparse vector ^ eMP such that its approximation error 

a$ — sup \g{z)-g0{z)\ 
zez 

is sufficiently small. In what follows, we view ;9 as a target value to be estimated. If 
g{z) can be represented as a finite linear combination of basis functions, we can take 
/9 as the true coefficient vector (in that case, — 0). However, it should be noted 
that, even in that case, /3 can be different from the true coefficient vector, /3*, say. 
This happens when /3* itself is not sparse but there is a sparse vector ^ such that its 
approximation error is small. Another possibility is, of course, that g{z) can not be 
represented as a finite linear combination of basis functions. For ease of exposition, we 
refer to the case that = as "zero bias case" and the case that 7^ as "non-zero 
bias case". Although f3 is generally not uniquely determined, the results below hold 
for any /3 satisfying the restrictions stated later. 

To define the group Lasso estimator, wc prepare some notation. Define Xi := 
(ipilzi), . . . ,iljp{zi))' . As in the usual regression case, wc also call Xi explanatory 
variables. Let {Gi, . . . ,Gq} be a partition of {l,...,p} such that Gi = {1}, i.e., 
UJ=i Gj = {1, . . . ,p} such that d = {!}, Gk ^ ^ for all k and n G/ = for all 
k ^ I. Throughout the paper, we assume q > 2. We view each Gk as a "group" for 
the explanatory variables a?j. Let pk denote the cardinality of Gk, i.e., pk '■= \Gk\- 
For f3 e W, we write (3g, = (/3,-,j G G^) E W'',S{f3) := {1} U {k e {2, . . . , q} : 
Wl^GklU > 0}, and use the notation ps := YlikesPk for 5* C {1, . . . , g}. Define X :— 
[xi ■■■ Xn]', S := n-^X'X, S := E[xix[], Xg, := [xig^ ■ ■ ■ XnG,]', % := u-^X'g^Xg, 
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and Sfc := EIxiQi^x^qJ. Working with this notation, we consider the group Lasso es 
timator: 

-J2Pr{y^-x[f3) + -J2V^,\\±f(3, 



(3 :— arg min 



n"-^' " ' „ ^ fell 

1=1 k=2 



(2.2) 



where Pt{u) :— {r — I{u < 0)}u is the check function and A is a nonnegative tuning 
parameter. It should be noted that the constant term Pi is not penahzed, which 
is standard in the hterature. The existence of the group Lasso estimate is always 
guaranteed, although it may not be unique. By the nature of the group Lasso penalty, 
0Gk — for some k. This means that the group Lasso can select groups of variables. 

We wish to establish a non-asymptotic bound on the £2-estimation error ||/3 — ;9||2. 
The main assumptions we make are: (i) the number of elements in the active groups, 
Ps{0)j is sufficiently smaller than n, and at the same time (ii) the approximation error 
is sufficiently small, which means that the conditional quantile function is reasonably 
well approximated by a function with a groupwise sparse vector y9 (more precisely, 
we are assuming the existence of a vector satisfying (i) and (ii) , and taking y9 as such 
a vector) . 

We shall mention that Ps{/3) is always larger than or equal to \\^\\o since ySc^ for 
k ^ S may have zero elements. The group Lasso presumes a prior knowledge on the 
sparsity pattern that f3 is groupwise sparse. It would make sense to say that the prior 
knowledge is accurate if Ps{0) is close to ||/3||o- Intuitively, it is expected that the 
performance of the group Lasso depends on the accuracy of the prior knowledge. In 
fact, our theoretical results discussed below give a support on this intuition. 

A word of notation. For S C {1, . . . we use the notation := 'S'\{1} and 
S*^ := {1, . . . , q}\S. For notational convenience, let S :— S{^). 

2.2 Computation 

This subsection is concerned with the computational aspect of the group Lasso problem 
(2.2). Put Afc := X^/pk- We observe that the problem (2.2) is formulated as: 

n n q 

min r5^r]+ + (l-r)^77r + ^Afc^;fc (2.3) 

s.t. T7+ - r7~ = 2/ - X/3, 

Wm'^f^Gkh <Vk, k = 2,...,q, 
T7+ > 0,77" > 0, 

where the inequalities are interpreted coordinatewise. The problem of such type is 
called a second order cone programming (SOCP) problem (Lobo et al., 1998; Alizadeh 
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and Goldfarb, 2003). The dual of (2.3) reduces to 



max y'a 

s.t. hi = = 0, 

ll&cJh < Afc, A; = 2, 

ae [t-1,t]". 

It is clear that there exist strictly feasible solutions to the primal (2.3) and the dual 
(2.4) problems. Therefore, optimal solutions to those problems exist (cf. Alizadeh and 
Goldfarb, 2003, Theorem 13). In practice, we may efficiently solve the problem (2.2) 
by using primal-dual interior point algorithms. For instance, in MATLAB implemen- 
tation, we may use the SeDuMi package (Sturm, 1999) to solve SOCP problems. 

3 Main results 

3.1 Conditions 

We first introduce some basic conditions. 

(CI) {{ui, z-)' : i — 1,2,...} are independently and identically distributed (i.i.d.) 
where the pair (yi, z[)' satisfies the model (2.1). 

(C2) Let F{u\z) denote the conditional distribution function of Ui given Zi = z. 
Assume that F[u\z) has a continuously differentiable density f{u\z) such that 
there exist positive constants g,Cf,Cf and Lf such that c/ < f{u\z) < Cf on 
[—g,g] X Z and |/'(m|z)| < Lf on the support of {ui,z[y, where f'{u\z) :— 
df{u\z)/du. 

(C3) E[||a;i||3] < oo. E[xiG,x[aJ = Ip, for all A; = 2, . . . , g. 

(C4) a^<g and E[/(0|zi){^(zi) - g^izi)}] = 0. 

Condition (CI) defines the data generating process. Conditions (C2) is standard in 
the quantile regression literature. Condition (C3) is a moment condition on Xi. Given 
that Sfe is positive definite, the normalization = /p^, does not lose any generality 
since we can always rescale the original parameter so that the explanatory variables 
satisfy this normalization. In fact, given the original parameter f3^ and the explanatory 
variables a;°, define the rescaled parameter f3 — £>^/^/3° and the rescaled explanatory 
variables Xi — D~^l'^x^^ where D :— diag(l, S2, . . . , Eg). The convergence rate under 



(2.4) 

...,q, 

= 2,...,g. 
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the original paramctrization follows from the relation: — /3°||2 < fi;~^/^||/3 — ;9||2 
where k is the mimimum eigenvalue of D. Condition (C4) is a technical requirement on 
^. The first part of condition (C4) imposes a preliminary bound on the approximation 
error a^. Under condition (C2), this ensures that f{g^{z) — g{z)\z) > Cf on the 
support of Zi. The second part of condition (C4) does not lose any generahty since we 
can modify /9 such that it satisfies the second part of condition (C4) without changing 
the sparsity pattern and the order of the approximation error. In fact, we can show 
the next lemma. 

Lemma 3.1. Assume conditions (C1)-(C3). For a given /3° G M.^, there exists a vector 
such that E[f{0\zi){g{zi) - g^{zi)}] = 0, S{$) = S{$^) and < 2a^o. 

For a proof, it suffices to see that the vector ^ e MP defined by Pi — + 
E[f{0\zi)]~^E[f{0\zi){g{zi) — g0o{zi)}] and ~ 0^_i satisfies the requirements of 
Lemma 3.1. The second part of condition (C4) is used to separate the effect of the 
trivially small group Gi. See the proof of Lemma A.4. 

Define 7 e (0, 1) by 

1 - 7 = P{||sf - IpJI < 0.5, 2 < < q}. 

The constant 0.5 is not important. We implicitly assume that 7 is small, which means 

"1/2 

that with a high probability, are not too much deviated from their population 

values. We will give primitive sufficient conditions to guarantee that 7 ^ as n ^ 00 
(g and Pk may depend on n). 

In what follows, let cq > 3 denote a fixed constant. Define 

C:= C(co,5) := {aeW: ^k\\(^G,h < CoY^^kW^^cM ■ 

keS'= keS 

The set C is a cone, i.e., for any a E C and c > 0, ca. E C It consists of vectors 
q: G such that the coordinates of a in the set S arc dominant. Such cones of 
dominant coordinates play an important role in the analysis of penalization methods 
for high dimensional statistical models. The present definition of C comes from the 
fact that with a suitable choice of A, /3 — /3 concentrates on C with a high probability. 
We introduce a geometric condition associated with the cone C. 

(C5) (Restricted eigenvalue condition), <f>m.m{co, S) :— infc^gcnsp-i ||5]^/^a||2 > 

0. 

Condition (C5) is adapted from the restricted eigenvalue condition of Belloni and 
Chernozhukov (2011). The restricted eigenvalue condition originates from Bickel et al. 
(2009). The value of Cq allowed depends on whether condition (C5) is satisfied. Note 
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that condition (C5) is more stringent if Cq is larger. In the case that S is positive 
definite, Cq can be any constant larger than 3. We also define ^max '■— (pmax{co, S) :— 

supcecnsf-i ||S^/2q:||2. 

Following Belloni and Chernozhukov (2011), we introduce the restricted nonlin- 
ear impact (RNI) coefficient to deal with the nonlinearity of the quantile regression 
problem: define 

r :— r [Co, S) :— - — 7 mf ——7^ 

-^/0max «6cnsp-i E[|a'a;iP] 

The present definition of the RNI coefficient is a natural extension of the same concept 
defined in Belloni and Chernozhukov (2011) to the group Lasso case. Thus, their 
comment on the RNI coefficient basically applies to the present case. Note that under 
condition (C5), r* is positive. 

A motivation to introduce the RNI coefficient is concerned with the identiffcation 
power of ^. In fact, for 6 G [r*,r*) where is deffnes as := 6Cfa^/cf(f)^i^i, it is 
shown that E[p^(i/i — x[/3) — Priui — x[^)] > Cf(f)'^^J\(3 — $\\l/Q whenever /3 — /3 e C 
and 11/3 — /3||2 = 5 (see the proof of Lemma A. 2). 

3.2 Bound on the ^2-estimation error 

We give a non-asymptotic bound on the ^2-estimation error ||/3 — /3||2. We introduce 

some notation used in the statement of the theorem. Let Ai, and B be any positive 
constants. Recall that ps '■= "^kesPk for C {1, . . . , g}. Define 

Pmin min pk, 

2<k<q 

Co + 3 

ci := 

Co - 3 

C2 := 12\/2(co + 1), 

Xa := (4\/2 + A + Ai) + A2^fnk^gqlp~, 



V \ Pmin / ^ 

Theorem 3.1. Work with the same notation as above. Assume conditions (C1)-(C5). 
Take A = ciXa- Then, with probability at least 1 - 26"^?/^ - IQq^-^^n'^^ - 64g^--^' - 57, 
we have 



11/3 -/3|b< 



provided that the last expression is smaller than r* . 



V— f^, (3.1) 
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Remark 3.1. When we say "with probabihty at least 1 — ^" and 6 > 1, then this 
means "with probabihty at least zero" (in that case the statement is interpreted as a 
null statement). 

The last condition restricts the magnitude of p^. A similar condition appears in 
Belloni and Chernozhukov (2011, Theorem 2). It also restricts the magnitude of the 
approximation error a^. These restrictions will be clearer in the asymptotic situation 
discussed below. The bound (3.1) is similar in flavor to the bounds derived in Huang 
and Zhang (2010, Theorem 5.1), Lounici et al. (2010, Theorem 3.1) and Nagahban et 
al. (2010, Corollary 4). However, all these results are on Gaussian linear regression 
models. 

The proof of the theorem appears in Appendix A. The proof is different from 
that of Huang and Zhang (2010) as they exploited the specific property of the least 
squares problem. The approach taken is similar in spirit to that taken by van de 
Geer (2008) and Belloni and Chernozhukov (2011). The first step of the proof is to 
establish that /3 — /3 concentrates on the cone C with a specified probability; the 
second step is to relate the bound on the estimation error to the tail behavior of some 
empirical process over the cone C; the third step is to estimate the tail probability of 
the empirical process by using the symmetrization inequality, the comparison theorem 
and some concentration inequality for Rademacher processes. An important difference 
from Belloni and Chernozhukov (2011) is that to obtain a sharper bound we use a 
Bernstein type inequality instead of a Hoeffding type inequality to the estimation of 
the tail probability of Rademacher processes (these two inequalities make no significant 
difference for the £i-penalty case). Using a Hoeffding type inequahty (such as (4.12) 
in Ledoux and Talagrand, 1991) in the present problem leads to a cruder bound of 
the form const, x -y/p^log q/n, which is not satisfactory to our purpose (recall that 
Ps ^ ll/^llo and the convergence rate of the ^i-penalized estimator is a/H/QUo logp/n). 
Another difference is that we allow for the possibility that g ^ g^. Thus, we have to 
take into account of the approximation error. A minor but not negligible point is that 
we have to pay a special care for the treatment of the constant term to separate the 
effect of the trivially small group Gi. Clearly, if pmin is replaced by the minimum over 
1 <k < q, then the bound becomes const, x ^/p§\ogqJn. Thus, to get a sharp bound, 
it is essential to separate the effect of the trivially small group Gi. 

To gain a clearer intuition on the bound, it is useful to consider the asymptotic 
situation. In what follows, we assume that all parameter values are indexed by the 
sample size n, and take the limit as n ^ oo. For instance, q = q{n). We additionally 
assume: 

(C6) 1 < 0inin and ^^ax < 1- 
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(C7) 7 ^ as n ^ oo. 

(C8) 4< 1 AH=A«fi + !Hiiy 

Condition (C6) is trivially satisfied if the smallest eigenvalue of S is bounded away 
from zero uniformly over n, and the largest eigenvalue of S is bounded uniformly over 
n. Condition (C7) is a high level condition. We will give primitive sufficient conditions 
to ensure condition (C7). It should be noted that condition (C7) implicitly restricts 
the growth rate of q and p^. Condition (C8) restricts the decreasing rate of a^. It 
restricts to go to zero sufficiently fast. Condition (C8) ensures that A < 1 and 

y n V Pmin/ 

For simplicity of exposition, we here assume that the approximation error is at most 
of the same order as the estimation error. 

Corollary 3.1. Assume conditions (C1)-(C8) where the constants Q,Cf,Cf and Lf are 
independent of n. 

(i) Take A = t^/n{l + -y^log q/pmm) where t — t(n) is an arbitrary sequence such that 
t — >■ oo as n ^ oo. Then, as n ^ oo, 

provided that the last expression is of order o(r*). 

(a) In the case that q ^ oo as n ^ oo, take A = ^/tn + A^J^nAog<[[p^ with a 
sequence t — t{n) — > oo and a constant A > 16-\/2 independent of n. Then, we 
have 

provided that the last expression is of order o{r*). 

The corollary is immediate from Theorem 3.1. The last condition restricts the 
growth rate of pg. In the optimal case, 1 < r*. For instance, it is true for the case 
when a3i^_i has a log concave density (Belloni and Chernozhukov, 2011, Comment 2.2). 
In that case, the corollary holds when 

^fl + i^l^O. 

n \ Pmin/ 
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In the case that HcciCfc/v^lb is uniformly bounded, i.e., HaJiOfc/v^lls ^ ^ some 
constant K independent of n, then, for a e C, 



la'ccil < K^^\\olgJ\\2 

k=l 

< {1 + co)K^^\\aGj\2 
fees 

Thus, 1/ ■Jp§ < r* under condition (C6). In that case, the corollary holds when 



i M + naz 1 ^ 0. 

^ V Pmin . 



Ps f. : log 9 



When all the groups G2, . . . ,Gq are of equal size, i.e., p2 — ■ • ■ — Pq —'■ m, then 



n n 



in the case (i), and 



ll/3-/3lb5p\/*?+'^"°'' 



n n 

in the case (ii). In this case, we have a plausible interpretation on the bound. The first 
part, \^p§Jn, reflects the difficulty of estimating pg (=the number of variables in true 
active groups) parameters, while the second part, ^/\S\]ogqJn, reflects the difficulty 
of finding active groups from total q groups. 

The corollary (roughly) recovers the result of Belloni and Chernozhukov (2011) on 
the convergence rate of the £i-penalized quantile regression estimator. In fact, when 
Pk — 1 for all k, then q = p and /3 reduces to the ^i-penalized quantile regression 
estimator. Consider the zero bias case, i.e., ~ 0. In that case, the corollary shows 
that when p — )■ 00, for A = A^/nTogp with a constant A > 16v^ independent of n, 
11/3 — /3||2 a/||/3||o logp/n, provided that the right side is of order o(r*) (t is chosen 
such that t < logp). 

The corollary explains situations under which the group Lasso estimator is po- 
tentially superior /inferior to the £i-pcnalizcd estimator. If p5/||/3||o = o(logp) and 
Ps/pmin = o{\\(3\\q), which means that the number of "inactive" elements in the active 
groups are relatively small, then the group Lasso estimator has an improved bound 
over the £i-penalized estimator. This fits in the intuition that the group Lasso estima- 
tor should have a better performance when a prior knowledge on the sparsity pattern 
is accurate. Another good news to the group Lasso is that in some cases it can at- 
tain the convergence rate arbitrarily close to the oracle rate \fp§Jn^ the rate which 
could be achieved when the true sparsity pattern S were known (see He and Shao, 
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2000, for convergence rates of general M-estimators in presence of diverging number 
of parameters). On the other hand, if ||;3||ologp = o{p§), the group Lasso is possibly 
inferior. It is also worthwhile to remark that the bound depends on the smallest group 
(except for the first group). To be precise, pmin in the bound can be replaced by the 
minimum value in a set of "large" p^s with a suitable change to the choice of A. Sup- 
pose that some groups, say G2, ■ ■ ■ , Gg^, have small sizes, i.e., Pk ^ i uniformly over 
k — 2, . . . ,qi. As long as gi < 1, pmin in the bound can be replaced by the minimun 
value in Pq^+i, ■ ■ ■ ,Pq- This modification is straightforward. In the course of the proof 
of Theorem 3.1, we just have to separate the effect of 6*2, ... , Gq^, as we do for Gi in 
the present proof (see the proofs of Lemmas A. 3 and A.4). However, the group Lasso 
of the present definition may not work well (more precisely, its performance does not 
exceed that of the £i-penalized estimator) if there are many small groups in G2, ■ ■ ■ ,Gq. 
These observations are in lines with Huang and Zhang (2010). 

We end this section with giving primitive sufficient conditions to ensure condition 
(C7). The proofs of Lemmas 3.2 and 3.3 below appear in Appendix B. We shall com- 
ment that giving a sharp condition on the growth rate of q and pk is not a trivial task. In 
fact, sharp conditions for consistency (in the operator norm) of high dimensional sam- 
ple covariance matrices under various distributional conditions are extensively studied 
in the recent geometric functional analysis literature Vershynin (see 2011, for review). 
We first consider the case that HcciCfc/v^lh is uniformly bounded. 

Lemma 3.2. Assume conditions (CI) and (C3). Suppose that there exists a positive 
constant K independent ofn such that \\xiGk/ ■\/Pk\\2 — ^ almost surely for all2 < k < 
q. Then, condition (CI) holds if Pmax^og{q V n)/n ^ 0, where Pmax '■— ^^2<k<qPk- 

The proof of Lemma 3.2 relies on the combination of Talagrand's (1996) concen- 
tration inequality for empirical processes and Rudclson's (1999) inequality for Gram 
matrices. A careful examination of the proof gives an exact bound on 7. Given that 
ll^iGfe/^/Pfclb is uniformly bounded, the growth condition on q and Pmax is considerably 
weak. A primitive sufficient condition for the uniform boundedness of ||a;iGfe/\/Pfc||2 is 
that each element in Xi is uniformly bounded. 

Wc next consider the case that Xi satisfies a subgaussian condition. For a real- 
valued random variable X, we define 

\\X\\^^ := inf{s > : E[exp{Xys^)] < 2}. 

We refer to van der Vaart and Wellner (1996, Section 2.1) for the ■ijj2-noTm. Recall that 
S := E[xix[]. 

(C9) XiS are generated as Xi — Y^I'^Xi where i^s are i.i.d. with E[iii^] = Jp and 
suPc^g§p-i ||Q;'ii||^2 — ^il> some constant C^ independent of n. 
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It is worthwhile to note that under condition (C6), condition (C9) ensures that E[|a'a;i|^] < 
E[(a'a;i)^]^/^, which imphes that 1 < r*. 

Lemma 3.3. Assume conditions (C3) and (C9). Then, condition (C7) holds «/(pmaxV 
logg)/n — >■ 0. 

The proof of Lemma 3.3 rcUcs on a resuh from geometric functional analysis 
(Mendelson et al., 2007). The case that Xi_i has a centered normal distribution is an 
example that satisfies condition (C9) but does not satisfies the condition of Lemma 
3.2. It should be noted that Lemma 3.3 does not cover Lemma 3.2. Suppose, for the 
illustrative purpose, that q = 2 and Xi _i has a uniform distribution over the finite 
set {^/p^^el, . . . , ^/p^^ep_l} where {ej}^~l is the canonical basis on In that 

case, Efcci^-ia;^ _i] = Ip-i and ||a;i,-i/-\/P ~ MU — 1 but the supremum of the '02-norm 
of ct'xi-i over a e S^~^ diverges as p — >■ oo. 

3.3 Data-dependent choice of the tuning parameter 

Belloni and Chernozhukov (2011) proposed a data- dependent choice of the tuning 
parameter for the £i-penalized quantile regression estimator. Their proposal can be 
extended to the group Lasso case. 

By the proof of Theorem 3.1, it is seen that A should be taken such that the 
probability of the event {A > ciA} is close to one (see Appendix A for the definition 
of A). In fact, the constant choice A = CiA^ is taken as an upper bound on the (1 — 
2e~^i/2 — 16g^~"^2/^^^)-quantile of CiA (see Lemma A. 4). Thus, a suitable approximation 
to a high quantile of ciA will work in place of A = ciXa- In this regard, we consider to 
approximate a high conditional quantile of A given zj* := {zi, . . . , z^}. Although A is 
unknown and the conditional distribution of A given z" is not pivotal (i.e., it depends 
on unknown parameters) in presence of the approximation error a^, as long as is 
small, it is expected that A is close to 

n 

^ •= .^fi^ II - ^(^^ ^ 0)}(S-'/'x,G,/VP^)||2, 

i=l 

where S^"'^''^ is interpreted as the generalized inverse of tl^^ if it is singular. The 
conditional distribution of A given z" is pivotal since /(-Uj < 0) are i.i.d. Bernoulli 
random variables with probability r independent of z". Let A(l — 0\z^^) denote the 
conditional (1 — ^^)-quantile of A given z". Since the conditional distribution of A given 
z" is pivotal, A(l — ^|z") is computable by simulation. It thus makes sense to take 
A = ciA(l - ^|z^) for small 9 e (0, 1), as it is expected that P(A > ciA) fa P(A > 
ciA) 

To summarize, we propose to implement the group Lasso as follows. 
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1. Determine a small 9 e (0, 1), e.g., 9 — 0.1, and a constant c > 0. 

2. Compute K{1 — 9\z'^) by simulation. 

3. Compute the group Lasso estimate (2.2) for A = cA(l — 

A recommended choice of c is a slightly larger value than 1, e.g. c = 1.1. In the 
remainder of this subsection, we discuss a theoretical validity of the proposed data- 
dependent choice of the tuning parameter. We separately consider the zero bias and 
non-zero bias cases. 

Zero bias case: Suppose first that = 0. In this case, A = A. Thus, for A = 
ciA(l — 9\z''^), we have P(A > ciA | 2;") = 1 — 9. Let Ai and A2 be any positive 
constants such that 2e-^i/2 + I6gi-^i/i28 < q_ Then, by Lemma A.4, A(l - 9\z^) < 
Xa — (4-\/2 -I- Al)^/n + A2y^nlogq/pinm- Therefore, in view of the proof of Theorem 
3.1, we obtain the next corollary. 

Corollary 3.2. Assume conditions (C1)-(C5) with = 0. For a given 9 G (0,1), 
take A = CiA{l — 9\zf). Let Ai, A2 and B he any positive constants such that 2e~^i^'^ + 
16g^~"^2/i28 ^ 0^ Then, with probability at least 1 — 64g^~^^ — 9 — 57, the inequality 
(3.1) holds with A replaced by ciXa, provided that the upper bound is smaller than r* . 

The results analogous to Corollary 3.1 hold with suitable modifications. Although 
the constant choice A = c\Xa is available in the zero bias case once ci is determined, 
we recommend to use the above data-dependent choice because the constant choice 
A = CiXa is only an upper bound on the (conditional) (1 — 2e~^i/^ — 16g^~^2/^^^)- 
quantile of CiA, and may be too large in practice. This point is discussed in Belloni 
and Cernozhukov (2009, Comment 2.1) in a different context. 

Non-zero bias case: In this case, since 7^ 0, A is not equal to A. Thus, the 
conclusion of Corollary 3.2 does not hold. However, as long as — )■ sufficiently fast, 
it is expected that A x A(l — 9\z'^) with a suitable sequence 9 = 9{n) gives an 
asymptotically correct choice. In fact, we can show the next theorem. 

Theorem 3.2. Assume condition (CI). For any sequence t = t{n) 00 such that 
tn^-^/^A/l + logg/pmin — ^ 0, take 9 = (e V gVpmin^-t^^ Then, there exists a positive 
constant M such that for large n, with probability one, 

M-HV^{1 + y/\og q/p^in) < A(l - 9\z^) < MtV^il + log q/p^in). 

Therefore, the conclusion of Corollary 3.1 (i) holds for X = cA(l — 9\zi) where c is 
any positive constant. 
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The proof of Theorem 3.2 appears in Appendix C. Theorem 3.2 shows that the 
proposed data-dependent choice (with a suitable sequence ^ ^ 0) is asymptotically 
valid even for the non-zero bias case, although in contrast to the zero bias case, a 
non-asymptotic result hke Corollary 3.2 does not hold. 



4 Additive quantile regression model 

In this section, we focus on the nonparametric additive quantile regression model: 

d 

Vi^ g(zi) +Ui, g(z) ^c + ^gk{zk), P(ui < \ Zi) ^ t, (4.1) 

k=l 

where c, gi, . . . , ga are unknown. Let Zk <Z M. denote the support of Zik for each 
k — 1, . . . ,d. For identification, we normalize gi, . . . ,g^ such that 



/ gk{z)dz = 0, A; = 1, . . . ,d. 



We allow for the possibihty that d is much larger than n. We assume that some of the 
functions are zero. Without loss of generality, we assume that ga^+i = 0, . . . , = 0. 
As in Huang et al. (2010), we further assume that the number of non-zero functions, 
di, is fixed. 

Suppose that for each k — 1, . . . , d we have a set of basis functions {■0^^ : j — 
1, . . . , m} on Zk such that 



ilJkj{z)dz = 0, J - l,...,m. (4.2) 

' Zk 

For each k, we have a series approximation: gk{z) ~ YlY=i 0kj'^kj{z). Define Xic^ : = 
l,iCiGfe := ii'k-iAizi^k-i), ■ ■ ■ ,^k-i,mizi^k-i)y ior k = 2, . . . , d+1 and cc, := [xia^, x[^,^, . . 
with q = d+l. For /3 = (/3o, Ai, . . . , A™, /^si, • • • , f^dm)' G we write = f3o 

and /Bcf, = {Pk-i,i, ■ ■ ■ , Pk-i,m)' for A; = 2, . . . , g = + 1. In what follows, we follow 
the same notation as Section 3. As noted in Introduction, the group Lasso is suited 
to the estimation of high dimensional additive models, since selecting variable Zik is 
equivalent to make jSc^. = 0. 

The proposed estimation method for g is as follows. 

1. Determine a small 9 e (0, 1) and a constant c > 0. 

2. Compute A(l — 6*12^) by simulation. 

3. For A = cA(l — 0\z^), compute the group Lasso estimate: 



(3 :— arg min 



1=1 fe=2 
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4. Construct g{z) := /3o + Ylk=i EJLi ^kjAjizk)- 

We wish to derive the convergence rate of g. To this end, we introduce some 
regularity conditions. Unless otherwise stated, all parameter values are indexed by n. 
For i/ > 0, let C^{E) denote the set of all continuous functions /i on a bounded set 
£■ C R such that the i/-derivative /i^^ exists and is [v — z/)-Holder continuous, where 
is the greatest integer smaller than u. 

(Dl) {{yi,z'^' : i — 1,2,...} are i.i.d. where the pair {yi,z[y satisfies the model 
(4.1). 

(D2) The support Z of Zi is a compact subset of Mf^. 

(D3) Let F{u\z) denote the conditional distribution function of ui given Zi — z. 
Assume that F{u\z) has a continuously differentiable density f{u\z) such that 
there exist positive constants g,Cf,Cf and Lf independent of n such that Cf < 
f{u\z) < Cf on [—g, g] x Z, and < Lf on the support of (wi, z[y. 

(D4) gk e ClZk) for A; = 1, ... , di, where u > 1/2 is a fixed constant. 

(D5) maxi<fc<rf sup^g^^ • • • , '0fcm(^))'||2 = 0{w}/'^). 

(D6) For each k = \, . . . ,di, there exists a vector ■ ■ ■ ■> ^km)' ^^^^ ^^^^ ^^Pzez^ \9k{z)~ 
E7=i4°,V'fe,(^)| = 0{m-n as m ^ oo. 

(D7) The smallest eigenvalue of is bounded away from zero uniformly over (n, k), 
and the maximum eigenvalue of is bounded uniformly over (n, k). 

(D8) 1 < 0inin and 0niax < 1- 

(D9) m X and m log (i/n ^ as n ^ oo. 

Conditions (D1)-(D6) are standard in the hterature. In fact, they are adapted from 
Horowitz and Lee (2005). We also refer to Newey (1997) for basic materials on series 
estimation. Condition (D7) is also standard. Condition (D8) is a restricted eigenvalue 
condition. We will give an example in which conditions (D7) and (D8) are satisfied. 
Condition (D9) determines the order of m. It also restricts the magnitude of d. We 
allow for the possibility that d is of order o{exp(const. xn^'^/^^'^"'"^))}, which can diverge 
faster than n. 

For a function h:Z^R, define \\h\\L, := E[h{zifY/'^. 

Theorem 4.1. Assume conditions (D1)-(D9). Let t = t{n) oo he a sequence such 
thatt^{n^^-^''^/^^''+^^V{mlogd/n)} 0. Take \ = cA(l-^|z^) where 9 = (eVg^/'")-*" 
and c is a positive constant. Then, we have \\g — g\\L2 t{n~''/^'^''~^^^ V y'log d/n). 
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Remark 4.1. If logo? = 0{m), the term n is dominating, while if m = o(logci) 

(i.e., d is faster than exp(const. xn'^^^'^'^^^^)), the term y^log d/n is dominating. 

The proof of Theorem 4.1 appears in Appendix D. The proof is basically a veri- 
fication of the conditions of Corollary 3.1. Theorem 4.1 shows that g can attain the 
convergence rate arbitrarily close to Stone's (1982, 1985) oracle rate n^''/^'^'^+'^^ in the 
case that logrf = 0{m) (which still allows for the possibility that d has an expo- 
nential order in n). On the other hand, Huang et al. (2010, Corollary 3.1) showed 
that for additive mean regression models, under a set of similar conditions, the group 
Lasso estimator has the convergence rate y/\og(^d V n). The latter rate is 

significantly slower than n''^^'^'^'^^^^ if d has an exponential order. For instance, if 
d = exp (const, xn^/^^^+i^), n-^/(2^+^) v'log(ci V n) x ^-('^-i/2)/(2v+i)_ Although The- 
orem 4.1 focuses on the quantile regression case, a similar result would apply to the 
mean regression case. 

One might wonder that Corollary 3.1 implies that the £i-penalized quantile regres- 
sion estimator (with a suitable choice of the tuning parameter) has a convergence rate 
hke jT,-'^/(2'^+i)^iog(c? V n). However, that is not the case. The problem happens when 
verifying condition (C8), in particular, that A < 1. For the ^i-penalty case, since 
Pk — i for all k, to make A < 1, we need to assume that — 0{n~^^^), which requires 
the undersmoothing. Thus, at this moment, it is safe to say that it is not known 
whether the ^i-penahzed estimator has a convergence rate like Ti-^^/CSi^+i) ^log(d V n) 
for nonparametric additive quantile regression models. 

We give an example in which conditions (D7) and (D8) are satisfied. Suppose 
that Zi is uniformly distributed on [0, 1]'' and the basis functions are common for all 
k — 1, . . . ,d, i.e, ipkj = ipj ior all k — 1, d and j' = 1, . . . , m. In that case, Zk = [0, 1] 
ioT ell k — 1, . . . ,d and Z = [0, l]'^. Define the m x m matrix * by 



/ ilJi{z)ijjj{z)dz 
Jo 



Lemma 4.1. Suppose that Z\ is uniformly distributed on [0, l]'' and the basis functions 
are common for all k — 1, . . . ,d. If the smallest eigenvalue of ^ is bounded away from 
zero uniformly over n, and the maximum eigenvalue of ^ is bounded uniformly over 
n, then so is S. In particular, conditions (D7) and (D8) are satisfied. 

The proof of Lemma 4.1 appears in Appendix D. If the basis functions are orthonor- 
mal with respect to the Lcbesgue measure (such as Fourier series), ^ is the identity 
matrix, and the condition of Lemma 4.1 is trivially satisfied. Suitably normalized poly- 
nomial splines also satisfy the condition of Lemma 4.1 (see the proof of Newey, 1997, 
Theorem 7). 
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5 Simulation experiments 



In this section, we report simulation results to present the practical performance of 
the group Lasso estimator. We compare the group Lasso estimator with the li- 
penalized quantile regression estimator and the unpenalized quantile regression es- 
timator. Throughout all cases, we used the SeDuMi package in MATLAB to compute 
these estimates, and run 1,000 repetitions. Note that since the SeDuMi package is 
based on an interior point algorithm, the number of non-zero elements in the unpe- 
nalized quantile regression estimator may be larger than the sample size. Although 
we have not formally discussed the model selection property of the group Lasso, for 
reference, we report the number of selected variables in the simulation experiments. 

We first consider the zero bias case. Let $(•) denote the distribution function of 
the standard normal distribution. 

Model l:yi = x[^ + a, = 1, Xi,_, - N{0, {(0.25)l^-^l},- fc), 
e._$-i(r) ~ Ar(0,l), Xi±ei. 

We took p = 501, q = 101,^2 = ■ ■■ = pioo = 5,n = 200 and r G {0.25,0.5,0.75}. For 
the group Lasso and the £i-penalized quantile regression estimators, we followed the 
data-dependent choice of the tuning parameter discussed in Section 3.3. In each case, 
we took 6' = 0.1 and c = 1.1. For the vector y9, we consider two cases. 

(Case 1)^=(1,1^_^,0,...,0)'. 

5 

(Case 2) ^ = (1, L£^_^, • • • ,1^0^^, 0, . . . , 0)'. 

5 5 

" V ' 

5 groups 

By the discussion following Corollary 3.1, Case 1 is favorable to the group Lasso since 
the ratio p5/||/3||o = 6/6 = 1 is small. On the other hand. Case 2 is not favorable to 
the group Lasso since the ratio p5/||/3||o = 26/6 ~ 4.3 is relatively large. In fact. Cases 
1 and 2 are the best and worst case scenarios for the group Lasso, respectively, since 
in Case 1, all elements in the active group G2 are non-zero, while in Case 2, only a 
single element in each active group is non-zero. 

Table 1 shows the simulation results for Model 1. In Case 1, as expected, the group 
Lasso estimator performs the best. Its RMSE is nearly a half of that of the £i-penalized 
quantile regression estimator, and nearly one fourth of that of the unpenalized quan- 
tile regression estimator. However, in Case 2, the performance of the group Lasso 
estimator is inferior to that of the ^i-penalized quantile regression estimator, and little 
surprisingly, inferior to even that of the unpenalized quantile regression estimator for 
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r e {0.25,0.75}. In Case 2, the group Lasso estimator selects redundant models. For 
instance, for r = 0.5, the group Lasso selects on average about 20 variables, while the 
true number of non-zero coefficients is 6. This fact possibly worsens the performance 
of the group Lasso estimator. These observations are consistent with our theoretical 
results. 

Table 1 is about here. 
We next consider nonparametric additive models. 

100 

Model 2:yi = 0.1 + J29k{zik) + 0.5a{zi)ei, 

k=l 

gi{z) = z, 5(2(2;) = cos(7r2;), g^iz) = e{e'' - e + e^^), 5(4 = 0, ... , 5-100 = 0, 
a^{z) = 0.7 + O.lzl + O.lzj + O.lzl Zi ~ Unif[-1, 1]^°°, 
e._$-i(r)^iV(0,l), Zi±ei. 

This model incorporates the conditional heteroscedasticity. We took n e {400, 800} 
and T e {0.25, 0.5, 0.75}, and followed the data-dependent choice of the tuning param- 
eter discussed in Section 3.3 with ^ = 0.2 and c — 1. We used cubic sphnes with 4 
equidistant knots, so m = 7 and p — 1 + dm — 701 in this case. 

Table 2 shows the simulation results for Model 2. The group Lasso estimator 
performs significantly better than other estimators, especially when n = 800. The 
improvement of RMSE of the group Lasso estimator over that of the £i-penalized 
quantile regression estimator is about from 30% to 40% when n = 800. A bit interesting 
point is that RMSE of the unpenalized quantile regression estimator becomes worse 
when n increase from 400 to 800. This is possibly because the estimates of zero 
coefficients become larger when n increases, and n = 800 is not enough for estimating 
full 701 coefficients. 

Table 2 is about here. 

Finally, we shall comment that in most cases, the number of selected of groups by 
the group Lasso is not much deviated from the truth. This fact indicates the possibility 
that in the quantile regression case, under suitable conditions, the number of selected 
groups by the group Lasso is of the same stocahstic order as the truth. Similar re- 
sults are known in the linear regression case (Lounici et al., 2010) and the £i-penalized 
quantile regression case (Belloni and Chernozhukov, 2011). Belloni and Chernozhukov 
(2011)'s argument depends on the specific property of the linear programming formu- 
lation of the £i-penalized quantile regression problem, and is not directly apphcable to 
the group Lasso case. The model selection property of the group Lasso in the quantile 
regression case is left in the future research. 
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A Proof of Theorem 3.1 
A . 1 Preliminaries 

We introduce some technical results used in the proofs. The next theorem, which is 
due to Ledoux and Talagrand (1991, Theorem 4.7), is a Bernstein type inequality for 
vector valued Rademacher processes. 

Theorem A.l (A concentration inequality for Rademacher processes). Let ei, . . . , e„ 

be independent Rademacher random variables. Let B be a Banach space with norm 
II ■ II such that for some countable subset D in the unit ball of B' (the dual of B), 
\\x\\ = supjg£, |/(a;)| for all x & B. Let xi,...,Xn be arbitrary points in B. Put 
Z := II Yli=i ^i^iW- Then, for every t > 0, we have 



inequality "Z > M{Z) + t" is replaced by the two-sided inequality "\Z — M[Z)\ > t", 
then the constant 2 in front of the exponential term is replaced by 4. 

An immediate corollary of Theorem A.l is: 

Corollciry A.l. Work with the same notation as Theorem A.l. Then, for every A > 0, 



E[e^^] < 16exp{Av^2E[Z2j-t-4(Aa)2}. 

Proof Put Z' := \Z - M{Z)\. By Theorem A.l, P{Z' > t) < 4exp{-t7(8(72)} for all 
t > 0. By using the formula for the expectation of positive random variables. 



P{Z > M{Z)+t} < 2exp{-iV(8(7^)}, 



where M{Z) is the median of Z and a : 



supjg^ \/Xir=i fi^iY- U ihe one-sided 




l + 8V2^(Aa)e2(^'^)'. 
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Since e^'^ > 1 + > x, the right side is bounded by (1 + 8y^)e^*^''*°"'*^ < IGe'^^^"''^ . Thus, 
we have E[e^^] < e^^(^)E[e^l^-^(^)l] < 16 exp{AM(Z) + 4(A(7)2}. The desired result 
now follows from the fact that M{Z) < y/2E[Z'^]. □ 

A. 2 Proof of Theorem 3.1 

In this subsection, we provide a proof of Theorem 3.1. The proof consists of series of 
lemmas. We first prepare some notation. Put Wi := (t/j, £c-)' and mp{w) := p^{y — 
x'l3) -pr{y-x'$). Define M(/3) := E[m^{wi)]. For a function / : R such that 

E[|/(t«i)|] < oo, we use the notation G„/ := n'^/^ EILi{/(w^i) - E[/(w?i)]}. Define 

fio := - IpA < 0.5,2 < VA; < g}, 

n 

l<k<q ' 
1=1 

B5:={/3eMf :/3-^eC, ||/3 - = 5}, for 5 > 0, 

where Sj^^^^ is interpreted as the generahzed inverse of S^^^ if it is singular (i.e., if 
Sfe = UDU' denotes the spectral decomposition of where U is a pkXpk orthogonal 
matrix and D is a diagonal matrix with diagonal entries di > ■ ■ ■ > di > = = 
• • • = dp^^, then S^^^^ = L/diagjrf^^^^, . . . , d^^^"^, 0, . . . , 0}f/'). Recall that on ^Iq, S^. 
are nonsingular for all k. 

The first step of the proof is to establish that /3 — /3 G C on the event (A > CiA}nr2o. 
We estimate the probability of the event {A > ciA} in Lemma A. 4. 

Lemma A.l. {A > ciA} n Qq C {/3 - /3 e C}. 

Proof. By convexity of the check function, we have 

Pr{y^ - x'fi) - pM - > -{r - I{y^ < x'M<0 - 
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which impUes that on Qq, 

n q 

i=\ k=2 

n q 



=1 



k=2 



fe=i 



-A 5] V^lisf (/3-/9)Gj|2 + A5^V^||sf/3Gj|2 

fcG5-i fees': 

> -(A + A) J] V^lisf - $)g,\\2 + (A - A) ^ ;9gJ|2. 



fees 



fee5<= 



Thus, we have 

(A - A) 5] v^lisf /3gJ|2 < (A + A) 5] v^lisf (/3 - Ih- 

fees= fees 

On the event {A > ciA} fl Qq, this inequahty imphes that /3 — y9 e C. 



□ 



The next lemma relates the bound on ||/3 — /3||2 to the tail behavior of over 
C. Recall the definitions of and r*. 

Lemma A. 2. For any 5 e [r*,r*), we have 

{0-m2>S}n{X>c,A}nQo 

C |\/n sup |G„m^| > 5 Qc/0^i^n5 - 1.5A^/p^^ | . 

Proof. By convexity of the objective function, the fact that C is a cone and the previous 
lemma, on the event {||/3 — ^\\2 > S} H {X > CiA} n fig, there exists a vector (3 ^ Bs 
such that 

n q 

> E^M^O + aE v^dl^f - \\^fM2) 



i=l 



k=2 



Gfe 2J 



j=l 

n 



feGS- 



>^m^(z,)-A v^lisf (/3-^)gJ|2. 

«=1 feG5_i 

For the first term on (A.l), we have 

n 



(A.l) 



1=1 



> nM{f3) — sup \\/n£!nmp\. 



23 



Put a. :— 13 — 13. By Taylor's theorem, for 5 e [r*, r*), we have 



M{(3) > -Cfa^E[\a'x,\] + ^CfE[{a'x,f] - hfE[\cx'x,f] 



|3l 



Lf E\\ct'xi 



6 

> IcfElia'x.Y] 



■ Hia'x,) 



1 



where for the first inequahty, it is useful to invoke Knight's (1998) identity: 

Pr{u -v)- pr{u) = -{r - I{u < 0)}v + / {I{u < s) - I {u < Q)}ds. 

Jo 

For the second term on (A.l), on the event flo, we have 

A J2 VP^Ilsf (/3-^)gJ|2<1.5A J2 v^||(/3-y9)Gj|2<1.5av^. 
jfce5-i fceS-i 

Therefore, we obtain the desired conclusion. 

We now analyze the tail probabihty of ^/n\Gnm^\ over (3 ^Bj. 
Lemma A. 3. For any 5 > and t > ^max^'s/Sn, we have 



□ 



P < ^/n sup |G„m;3| > t> 

< 64gexp [- {t/S - C2(l + y/p§Z)V^V /{^(^2\/i^+Ps-JPr.udnV] +47, 
where we recall that C2 '■— 12-\/2(co + 1). 

Proof. Let ei,...,e„ be independent Rademacher random variables independent of 
Wi, . . . , Wn- Write P^ and for the conditional probability and the conditional ex- 
pectation with respect to ei, . . . , e„ given Wi, . . . , Wn, respectively. For /3 e 8$, we 
have 

E[m|s{w^f] < E[{x[{^ - < ct>l^8\ 

Thus, by the symmetrization inequality (van der Vaart and Wellner, 1996, Lemma 
2.3.7), for t > (pmax^V^, we have 



P <^ sup iGnTUfs] > n < 4E 



Pe { sup 



i=l 
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By Markov's inequality, the probability on the right side is bounded by 



exp <^ s sup 



i=l 



, for s > 0. 



(A.2) 



Fix ui := yi - x[f3,.. . := ?/„ - cc'„/3. Define (pi{-) := - ■) - Pr{ui) and 

hjji^x) := x'(3. Then, mi^^Wi) = Lpi{hf^_^{xi)) and each (pi is a contraction, i.e., 
|(^i(a) —ipi{h)\ < \a — b\ for a, 6 G M. Thus, by the comparison theorem for Rademacher 
processes (Ledoux and Talagrand, 1991, Theorem 4.12), we have 



exp <^ s sup 



exp <^ 2s sup 



13- ^{'^i 



1=1 

Put Zk-.^W Y17=i ^i^iGk/VPkh- For ^ e Bs,we have 

q f n 

k=l I i=l J 

< E V^ll - hZk + (max Z,) E VP^II (/3 - II2 

< E v^ll - ^^G, hZk + (max Zfc)co V VP^|| (/3 - ^)g, II2 



i=l 



fees 



fees 



< 5{co + l)(max Zk + y^pii; max Z^). 

l<fe<g * 2<fc<g 



Thus, by the Cauchy-Schwarz inequality. 



E, 



exp <^2s sup 



< <^E, 



i=l 



< E, 



exp < 2sS{co + 1)( max Z^ + JpsH max Zk] 

l<fe<q' 2<fe<5 



exp < 4s(5(co + 1) max 
i<fe<g 



1/2 



E. 



1/2 



exp <j 4s(5(co + l)^^iI7 max Z^ 

1/2 



1/2 



< <^ J] E, [exp {4s5(co + 1)ZJ] \ ■{Y.^e [exp {4s5(co + 1)^/^117^4] 



, fe=i 



fe=2 



Put a :— 4s5(co + 1). By Corollary A.l, on Q,q, we have 

E£[exp(aZjk)] < 16exp{1.5a-\/2n + 9a^n/pfe}, 
where we have used the fact that on Qq, 

n 

E[^fe'] =Pfe'Ell^^Gj|^ =Pfe'ntrE, < (1.5)V^Hr J^, = (1.5)V 

and 



sup V(a'a;iGfc/VPfc)^ = ^^ll^fell < (l-5)^Pfe ^ri. 



1=1 
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Similarly, on Qq, we have [exp{a^pg_^Zk}] < 16 exp{1.5ay'2p5_^n + 9a^'nps_^/pk}- 
Therefore, on fig, the right side on (A. 2) is bounded by 



16gexp{0.75a(l + ^/p^)V2n + A.5a^n{l + ps_Jpmm) - 



= 16?exp{3s5(co + 1)(1 + VPi::)V^ + s'{6V^(^(co + l)^n{l^p-s_Jp^,^)Y - si/4} 
= 16?exp{-(i - 6)s/4 + cs^}, 

where h := 125(co + 1)(1 + ,/PsZ)^ ^nd c := {6v^5(co + l)^n(l + ps_Jpmin)V. 
Minimizing the right side with respect to s > 0, on we have 



Pe { sup 



>^j><16gexp{-(i-6)V(64c)}. 



4 

Therefore, we obtain the desired conclusion. □ 

It remains to estimate P(A > c^^A). Recall that := {zi, . . . , Zn}- 
Lemma A. 4. For any ti > and t2 > 0, 

P{A > (4v^ + V^A + ti+ I z'^} 
< 2exp{-t2/(2n)} + 16{q - 1) exp{-p^ini^/(128n)}. 

Proof. Put Xik := 'Sj^^^'^Xic^/ y/Pk- Observe that 



A < max II - P{yi < Xi(3\xiGj}xik\\2 



l<k<q ^ 
i=l 

n 

l<k<q '—^ 
i=\ 

=: Ai + As. 
We first analyze Ai. For 2 <k < q,we have 

n n 

- PiVi < x'i$\xiGk)}xik\\l = sup I J^{t - P{yi < x[$\xiGk)}oc'xik 

i=l aG§P^-l 



2 



<J2^r-P{yi<x[$\xiG,)V sup ^(a'i,,)' 

< ri^Cjaypk, 

where the second inequality is due to the Cauchy-Schwarz inequality. For A; = 1, by 
condition (C4) and Taylor's theorem, 

I X^{t - P{y^ < x',P\x,G,)}x^GA < n\T - P{y^ < x'M < ^a|, 

i=l 
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which impUes that Ai < nLf 0^/2 V nC ja^/ ^Jp^ = y/n/S.. 

It remains to estimate A2. Put 77^^ := P(yj < x'^^\XiG^) — I{yi < ^i^)- Observe 
that for > and ^2 > 0, 



P(A2 > h+h I < P 



n 
i=l 



> U 



z" > + (g-l) max P 

^ ' ' 2<fe<g 



> u 



Since |?7ji| < 1, by Hoeff ding's inequaUty, the first term on the right side is bounded 
by 2 exp{— /(2n)}. It remains to estimate the second term. Fix 2 < k < q. Recall 

that for X eW 



, ||a3||2 = sup^ggpfe-i a'x. Let ei, . . . ,e„ be independent Radcmacher 
random variables independent of z^. By using the symmetrization inequality (van der 
Vaart and Wellner, 1996, Lemma 2.3.7), for t2 > ^/8n/pk, 



E 

i=l 



Tjik'^ik 



< 4E 



i=l 



> 



t2 



Since \r]ik\ < 1, by the contraction theorem for Rademacher processes (Ledoux and 
Talagrand, 1991, Theorem 4.4), the probability on the right side is bounded by 

t2 



2P. 



1=1 



> 



The desired result now follows from the concentration inequality for Rademacher pro- 



cesses (see Theorem A.l). 

Proof of Theorem 3.1. Take A = ciAa, 

6 



□ 



Cf<t> 



2 

min 



8(p 



2 

max 



n 



+ 



n 



V 



6C 



Since 6 G [r^,,r*), by Lemma A. 2, we have 



P[{||/3 - /3||2 > 5} n {A > ciA} n no] < P{^/n sup |G„m;3| > t}. 



The present choice of t ensures that t > ^max^vSn and 



t/S - C2(l + y/p§Z;)Vn > By/log q ■ AC2^ (1 + Ps_JPmin)n, 
which implies by Lemma A. 3 that 

F{^/n sup |G„m^| > t} < 64g^"-^' + 47. 

Thus, we have 

P{!l/3 - $\\2 > S} < P(A > cr'A) + 64g^-^' + 57, 

where by Lemma A.4, P(A > q U) = P(A > A^) < 2e-^?/2 ^ ^Gg^-^i/^^^ Therefore, 
we obtain the desired conclusion. □ 
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B Proofs of Lemmas 3.2 and 3.3 



11^ ■p 

Proof of Lemma 3.2. It suffices to show that max2<A:<g — Ip^,]] — 0. Fix 2 < k < q 
for a while. Define hoc{x) := {a'xc^y for a; G and a G EP''^^. Observe that 

1 



|Sfc-/pJ| = - sup 



5^{/i«(a;0-E[/ia(a;i)]} 



By Bousquet's (2002) version of Talagrand's (1996) inequahty, for all t > 0, 



Vlz> E[Z] + tK^2{npk + 4pfeE[Z]) + 



where we have used the fact that \ha{xi)\ < K'^Pk and ya,ic{ha{xi)) < E[(a'a;iGj.)^] < 
K'^PkE[{cx.'xiG^.y] — K'^Pk- We wish to estimate E[Z]. By Theorem 1 of Rudelson 
(1999), there exists a universal constant L such that 

E[Z] < LVr^log(p,Ve)(E[||xlGJ|2°"1)'/'°^^ 

provided that the last expression is smaller than n (an explicit value of the constant L 
and an elementary proof of Rudclson's inequality arc given in Oliveira (2010)). Since 



II^CiGfclb — K^/Pk, the last expression is bounded by fCLA/npmax log(pmax V e) =: f/„. 
The present hypothesis ensures that Un is of order o{n) and hence there exists a positive 
integer ni (independent of k) such that f/„ is smaller than n for all n>ni. Let n>ni. 
Then, for any i > 0, with probability at least 1 — e~*^, 

Z<Un + tK^2p^,^{n + Wn) + =: Un(t). 

Take t — •\/21og(g V n). Then, with probability at least 1 — g"^ A we have 
< Unit)- Since, by the present hypothesis that Pmaxlog(g y n)/n — )■ 0, tJnif) is of 
order o{n) and independent of /c, this implies that, by the union bound, max2<fc<5 — 

j,j|4o. □ 

Proof of Lemma 3.3. The proof is based on Corollary 2.7 of Mendelson et al. (2007). 
Define the set 

<? 

A := [j{T}l^cx/\\T}'''ot\\2 : a e §^-\ ac, = 0, V/ 7^ A;} C W-\ 

k=2 

Let L(-) denote an isonormal Gaussian process on MP with the respect to the Eu- 
clidean norm (see Dudley, 1999, Chapter 1 for basic materials on isonormal Gaussian 
processes). Write |-^(^)| := supQ,g_4 |I/(a)|. Pick any e > 0. In what follows, c and 
C denote some constants independent of n and e. Their values may change from 
line to line. Observe that ||S^/^a||2 = ||S^'^^q:gj^ II2 = ||aGj.||=l for a e EP~^ such 
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that cxgi = for all I ^ k. By Corollary 2.7 of Mendelson et al. (2007), as long as 
n > CE[\L{A)\]'^ / , with probability at least 1 — exp(— ce^n), 

1 

1 - e < - y^lot'xif < 1 - e, yoteA, 
n ^-^ 

1=1 

which implies that 

1 - e < ||Si^^a||^ < 1 - e, Va G §^'="\ 2<yk<q. (B.l) 

We wish to estimate E[|L(^)|]. Define Uk := {a G : ||«||2 < 1, ac, = 0, VZ A;} 
for /c = 2, Recall that for a e Uk d ||S^/^q;||2 = HaaJb = 1, so that 

A C {S^/^a : a. e IJfc=2^fc}- ^ version of L(-) is given by L{cx) = ot'v where 

V ~ A'"(0, Ip). Thus, we have 

E[|L(^)|]<E[ sup |L(SV2a)|] 
= E[ sup |a'(S^/2^)|]. 

For each A;, it is shown by a standard argument that there exists a 1/2-cover 11^ of Uk 
such that rife C Wfe, Uk a 2convnfe and |nfe| < S*** (convllfe stands for the convex hull 
of rife, and 2convnfe = {2a : a e convllfe}; for this result, cf. Mendelson et al., 2008). 
Then, n := UL2 is a 1/2-cover of [jl^^Uk such that H C 0^=2^^' 111=2^^ ^ 
2 conv n and |n| < (g'— 1)5^™''''. By a maximal inequality for Gaussian random variables 
(cf. Ledoux and Talagrand, 1991, Eq. (3.13)), we have 

E[ sup |a'(sV2^)|] <E[ \a.\Y}/^v)\] 

<^<^^l=2^k a:e2convn 

= 2E[max|a'(S^/2^)|] 
max V log q max 

< C'VPmax Vlogg. 

Therefore, by the present hypothesis that (pmax V log q)/n — > 0, the inequality (B.l) 
holds with probability approaching one. This implies the desired conclusion. □ 

C Proof of Theorem 3.2 

The second assertion follows from the first assertion and a careful examination of the 
proof of Theorem 3.1. Thus, we concentrate on showing the first assertion. Take Ai = 
^2{t^ log(e V gVPmin) + log 4} and A2 = ^128[{t^ log(e V gVPmi„) + log 32} / log g + 1] 
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so that 2e-^?/2 _ (g Vgi/f-")-*V2 and IGg^-^'/^^s _ (e Vg^/^'--)-*72. Since 26"^?/^ 
IQqi-Ayus ^ (e V gVPmin)-*2/2 + (e V q^P^^^)-^^ /2 = (e V gVPmin)-*^ ^ by Lemma 
A. 4, we have 



K{l-e\zl) < (4\/2 + + ^sV^logg/Pmin < + Vl0g?/Pmin). 

We wish to estabhsh a lower bound on A(l — 0\z^). By definition, 

n 

A> <0)} . 

i=l 

We use the minorization inequahty of Stout (1974, Theorem 5.2.2): let {Xi,i > 1} 
denote a sequence of independent random variables with zero mean and finite variance, 
and let Sn = J27=i-^i = XlILi -^[ATf]. Suppose that there exists a positive 

constant c such that |Xj| < cs„ for all 1 < i < n. Then, for any ^ > 0, there exist 
positive constants e{Q and 7r(^) such that if e > e{Q and ec < tt{Q, then 

P(V^n > e) > exp{-(6V2)(l + C)}. 

Take Xi — t — I{ui < 0) and C = 1- Then, = nT{l — r), and since \Xi\ < 1, 
c — l/sn < n~^/^. Let e = t-\/iog(e"v~gVp^^. Since e diverges, e > e(l) for large 
n, and ec < n~^/^t^/]ogJe^\/~q^^P^^ — o(l), which ensures that ec < 7r(l) for large n. 
Therefore, for large n, 

P |a > t^T(l-T)nlog(eV?VPmin) I > P(5„A„ > e) 

> exp(— e^) 

= (e Vg^/*'--)-*' 



Thus, we have 

A(l - > ty^T(l - r)nlog(e V gVPmin). 
Therefore, we obtain the desired conclusion. 



□ 



D Proofs of Theorem 4.1 and Lemma 4.1 

Recall that g,3{z) := ^o+ELi E^i Pkjipkjizk) for /3 = (^o, Ai, ■ ■ ■ , Am, Ai, ■ ■ ■ , ^dm)'- 

Proof of Theorem 4-1- In view of the discussion on condition (C3), under condition 
(D7), it does not lose any generality to assume that = 1^ for all k > 2. For 
k = di + l,...,d, let (/^o^, . . . , (5U := 0„. Recall condition (D6). Define /3° : = 
(^0, Pii: ■ ■ ■ 1 Pim: /^2i, ■ ■ ■ , /^dm)'- couditiou (D6) and the fact that di is fixed, we have 
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a^o := sup^g^ loi^) — 9^»{^)\ = 0{rn By Lemma 3.1, there exists a vector (3 & W 
such that E[/(0|zi){t/(zi) - gp{zi)}] = 0, /Scfc = for all /c = di + 2, . . . , g = o? + 1 
and < a^o. 

By Theorem 3.2, it suffices to check the conditions of Corollary 3.1 (i) with p2 — 
••• — Pk — Tn,q — l + d,p = 1 + dm and = 1 + di. It is not difficult to see 
that conditions (C1)-(C6) are satisfied. To see that condition (C7) is satisfied, recall 
Lemma 3.2. By condition (D5), condition (C7) is satisfied if mlog(d V n)/n — > 0, but 
this is ensured by condition (D9). To see that condition (C8) is satisfied, recall that 

— 0{m~^). On the other hand, 

^ A ^ A ^ f 1 + = A A - A ^ + (l + Mi±^\ > ^-2u/{2u+i) 

i/n n n \ pmin / \Ai n n \ m 

Thus, condition (C8) is also satisfied. Therefore, we have 



\\0 - /3|b tJl±^ (l + Mi±i^l) X V (D.l) 

provided that the right side is of order o(r*). By the discussion following Corollary 
3.1, it is seen that the last condition is satisfied if t{n~'^^^'^'^'^^^ V y^log d/n) — )■ is 
faster than m~^^'^ x n~^^'^'^'^'^^^\ which is satisfied by the present hypothesis that 
t2{n(^-2i')/{2<^+^)v(mlogrf/n)} ^0. Thus, (D.l) holds. In view of the proof of Theorem 
3.1, it is not hard to see that 



11^ - 9&\\l, = 11^4 - 90\\l, <p 0max||/3 - ^2 < " /Sib <p t{n-^'^^^+'^ V 
The desired conclusion now follows from the relation: 



1^ - qWl^ < \\g - QpWl^ + hp - qWl^ <p V ^/log d/n) + m-" 



□ 



Proof of Lemma 4-1- Invoke that by (4.2) S = diag(l, *), so that for /3 = 

(^G„/3^„...,/3^J' (q^d+l), 

k=2 

Thus, by the present hypothesis, there exist positive constants c and C (c < C) 
independent of n such that for all f3 = {^g^,(3'q^, . . . , (3'^ )' e R^+'^"', 



1 1 



(1 A c)Mi < + WMi < /3's/3 < +cj2\\f3Gji < (1 V cm\i 

k=2 k=2 

This implies the desired conclusion. □ 
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Table 1: Simulation results for Model 1 





Case 1 






r = 0.25 






r = 0.5 






r = 0.75 






NSG 


NSV 


RMSE 


NSG 


NSV 


RMSE 


NSG 


NSV 


RMSE 


GRLasso 


2.01 


6.07 


0.507 


2.01 


6.06 


0.440 


2.01 


6.03 


0.500 




(0.11) 


(0.57) 


(0.102) 


(0.11) 


(0.54) 


(0.081) 


(0.07) 


(0.35) 


(0.096) 


Lasso 


2.02 


6.01 


1.028 


2.02 


6.02 


0.803 


2.02 


6.01 


1.029 




(0.15) 


(0.18) 


(0.233) 


(0.15) 


(0.15) 


(0.152) 


(0.12) 


(0.16) 


(0.229) 


QR 


101.00 


501.00 


1.984 


101.00 


501.00 


1.936 


101.00 


501.00 


1.985 




(0.00) 


(0.08) 


(0.057) 


(0.00) 


(0.05) 


(0.05(3) 


(0.00) 


(0.05) 


(0.056) 




Case 2 


GRLasso 


3.44 


13.21 


2.251 


4.72 


19.60 


1.929 


3.36 


12.77 


2.261 




(1.13) 


(5.66) 


(0.162) 


(1.05) 


(5.27) 


(0.184) 


(1.15) 


(5.72) 


(0.151) 


Lasso 


5.00 


5.00 


1.894 


5.84 


5.85 


1.428 


4.93 


4.94 


1.901 




(1.09) 


(1.09) 


(0.323) 


(0.49) 


(0.51) 


(0.280) 


(1.10) 


(1.10) 


(0.314) 


QR 


100.00 


501.00 


2.115 


100.00 


500.99 


2.070 


100.00 


501.00 


2.114 




(0.00) 


(0.05) 


(0.049) 


(0.00) 


(0.08) 


(0.048) 


(0.00) 


(0.06) 


(0.048) 



"GRLasso" refers to the group Lasso estimator, "Lasso" to the £i- 
penahzed quantile regression estimator, "QR" to the unpenahzed 
quantile regression estimator, "NSG" to the number of seleeted 
groups (including Gi ) , "NSV" to the number of selected variables 
(among Xn,. . . , Xip), and "RMSE" to the root mean squared error 
^^£[11/3 — P\\2]- Standard deviations are given in parentheses. 
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Table 2: Simulation results for Model 2 









n = 


400 








r = 


0.25 


T = 


0.5 


r = 


0.75 




NSV 


RMSE 


NSV 


RSME 


NSV 


RMSE 


GRLasso 


3.13 


0.383 


3.17 


0.314 


3.14 


0.351 




(0.35) 


(0.062) 


(0.43) 


(0.044) 


(0.37) 


(0.050) 


Lasso 


3.14 


0.424 


3.15 


0.357 


3.12 


0.377 




(0.38) 


(0.048) 


(0.39) 


(0.030) 


(0.34) 


(0.036) 


QR 


100.00 


1.611 


100.00 


1.584 


100.00 


1.614 




(0.00) 


(0.151) 


(0.00) 


(0.15G) 


(0.00) 


(0.151) 








n — 


800 






GRLasso 


3.16 


0.241 


3.19 


0.211 


3.18 


0.232 




(0.39) 


(0.032) 


(0.44) 


(0.026) 


(0.43) 


(0.030) 


Lasso 


3.12 


0.323 


3.14 


0.300 


3.14 


0.308 




(0.34) 


(0.020) 


(0.38) 


(0.014) 


(0.37) 


(0.016) 


QR 


100.00 


2.224 


100.00 


2.100 


100.00 


2.229 




(0.00) 


(0.197) 


(0.00) 


(0.182) 


(0.00) 


(0.191) 



"GRLasso" refers to the group Lasso estimator, "Lasso" to the £i- 
pcnahzcd quantilc regression estimator, "QR" to the unpenalized 
quantilc regression estimator, "NSV" to the number of selected 
variables (among z^i, . . . , Zi^ioo), and "RMSE" refers to the root 



mean squared error ^E[\\g0 — g\Wj. Standard deviations are given 
in parentheses. 
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